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Abstract: We study a problem of quick detection of top-k Personalized PageRank lists. 
This problem has a number of important applications such as finding local cuts in large 
graphs, estimation of similarity distance and name disambiguation. In particular, we apply 
our results to construct efficient algorithms for the person name disambiguation problem. We 
argue that when finding top-k Personalized PageRank lists two observations are important. 
Firstly, it is crucial that we detect fast the top-k most important neighbours of a node, while 
the exact order in the top-k list as well as the exact values of PageRank are by far not so 
crucial. Secondly, a little number of wrong elements in top-k lists do not really degrade the 
quality of top-k lists, but it can lead to significant computational saving. Based on these two 
key observations we propose Monte Carlo methods for fast detection of top-k Personalized 
PageRank lists. We provide performance evaluation of the proposed methods and supply 
stopping criteria. Then, we apply the methods to the person name disambiguation problem. 
The developed algorithm for the person name disambiguation problem has achieved the 
second place in the WePS 2010 competition. 

Key-words: Personalized PageRank, Monte Carlo Methods, Person Name Disambiguation 



* INRIA Sophia Antipolis-Mediterranee, France, K.Avrachenkov@sophia.inria.fr 
t University of Twente, The Netherlands, N.Litvak@ewi.utwente.nl 
■t INRIA Sophia Antipolis-Mediterranee, France, Danil.Nemirovsky@gmail.com 
§ INRIA Sophia Antipolis-Mediterranee, France, Elena.Smirnova@sophia.inria.fr 
^ INRIA Sophia Antipolis-Mediterranee, France, Marina.Sokol@sophia.inria.fr 




Theme COM — Systemes communicants 
Pro jet Maestro, Axis 



Rapport de recherche n° 7367 — August 2010 — pages 



Unite de recherche INRIA Sophia Antipolis 
2004, route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex (France) 

Telephone : +33 4 92 38 77 77 — Telecopie : +33 4 92 38 77 65 



Les Methodes Monte Carlo pour Top-k Listes de 
PageRank Personnalise avec I'application a 
disambiguation de noms 

Resume : Nous etudions le probleme de detection rapide de top-k listes de PageRank 
Personnalise. Ce probleme a plusieurs applications importantes telles que la recherche des 
coupes locales de graphes, I'estimation de la distance de la similarite. et disambiguation de 
noms. En particulier, nous appliquons nos resultats a construction des algorithmes efBcaces 
pour le probleme de disambiguation de noms de personnes. Notre etude est base sur les 
deux observations suivantes. D'abord, il est cruciale que nous trouvons rapidement les top-k 
voisins les plus importants d'un noeud. Cependant, I'ordre exact dans le top-K ainsi que les 
valeurs exactes de PageRank sont de loin pas si cruciale. Deuxiemement, un petit nombre de 
elements errones dans les top-k listes ne degrade pas vraiment la qualite des listes de top-k, 
mais ce sacrifice ameliore significativement la performance des algorithmes. Sur la base de ces 
deux observations cles nous proposons des methodes de type Monte Carlo pour la detection 
rapide de top-k listes de PageRank Personnalise. Nous offrons revaluation des performances 
des methodes proposees et nous donnons criteres d'arret. En suite, nous appliquons les 
methodes au probleme de disambiguation de noms de personnes. Notre approche base sur 
PageRank Personnalise et les methodes Monte Carlo a recu le deuxieme prix de la competion 
WePS 2010. 

Mots-cles : PageRank Personnalise, Methodes Monte Carlo, Disambiguation de Noms de 
Personnes 
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1 Introduction 

Personalized PageRank or Topic-Sensitive PageRank [TB] is a generalization of PageRank 
|10| . Personalized PageRank is a stationary distribution of a random walk on an entity 
graph. With some probability the random walk follows an outgoing link with uniform 
distribution and with the complementary probability the random walk jumps to a random 
node according to a personalization distribution. Personalized PageRank has a number of 
applications. Let us name just a few. In the original paper [15] Personalized PageRank 
was used to introduce the personalization in the Web search. In pH [?T] Personalized 
PageRank was suggested for finding related entities. In [55] Green measure, which is closely 
related to Personalized PageRank, was suggested for finding related pages in Wikipedia. 
In [21 [3] Personalized PageRank was used for finding local cuts in graphs and in [3] the 
Personalized PageRank was applied for clustering large hyper-text document collections. In 
many applications we are interested in detecting top-k elements with the largest values of 
Personalized PageRank. Our present work on detecting top-k elements is driven by the 
following two key observations: 

Observation 1: Often it is crucial that we detect fast the top-k elements with the largest 
values of the Personalized PageRank, while the exact order in the top-k list as well as the 
exact values of the Personalized PageRank are by far not so important. 

Observation 2: It is not crucial that the top-k list is determined exactly, and therefore we 
may apply a relaxation that allows a small number of elements to be placed erroneously in 
the top-k list. If the Personalized PageRank values of these elements are of a similar order 
of magnitude as in the top-k list, then such relaxation does not affect applications, but it 
enables us to take advantage of the generic "80 /20 rule": 80% of the result is achieved with 
20% of efforts. 

We argue that the Monte Carlo approach naturally takes into account the two key obser- 
vations. In [5] the Monte Carlo approach was proposed for the computation of the standard 
PageRank. The estimation of the convergence rate in [5] was very pessimistic. Then, the 
implementation of the Monte Carlo approach was improved in |14| and also applied there to 
Personalized PageRank. Both [3] and [13] only use end points as information extracted from 
the random walk. Moreover, the approach of |14) requires extensive precomputation efforts 
and is very demanding in storage resource. In [S], it has been shown that to find elements 
with large values of PageRank the Monte Carlo approach requires about the same number 
of operations as one iteration of the power iteration method. In the present work we show 
that to detect top-k list of elements when k is not large we need even smaller number of 
operations. In our test on the Wikipedia entity graph with about 2 million nodes we have 
observed that typically few thousands of operations are enough to detect the top-10 list with 
just two or three erroneous elements. Namely to detect a relaxation of the top-10 list we 
spend just about 1-5% of operations required by one power iteration. In the present work 
we provide theoretical justification for such a small amount of required operations. We also 
apply the Monte Carlo methods for Personalized PageRank to the person name disambigua- 
tion problem. Name resolution problem consists in clustering search results for a person 
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name according to found namesakes. We found that considering patterns of Web structure 
for name resolution problem results in methods with very competitive performance. 

2 Monte Carlo methods 

Given a directed or undirected graph connecting some entities, the Personalized PageRank 
7r(s, c) with a seed node s and a damping parameter c is defined as a solution of the following 
equations 

7r(s, c) = C7r(s, c)P + (1 - c)lf , 

n 

^7rj(s,c) = 1. 

where is a row unit vector with one in the s^^ entry and all the other elements equal 
to zero, P is the transition matrix associated with the entity graph and n is the number 
of entities. Equivalently, the Personalized PageRank can be given by the explicit formula 

^(s,c)==(l-c)lf[/-cP]-\ (1) 

Whenever the values of s and c are clear from the context we shall simply write tt. 

We would like to note that often the Personalized PageRank is defined with a general 
distribution v in place of \^ . However, typically distribution v has a small support. Then, 
due to linearity, the problem of Personalized PageRank with distribution v reduces to the 
problem of Personalized PageRank with distribution 1^ |19| . 

In this work we consider two Monte Carlo algorithms. The first algorithm is inspired 
by the following observation. Consider a random walk {Xt\t>o that starts from node s, 
i.e, Xf) = s. Let at each step the random walk terminate with probability 1 — c and make 
a transition according to the matrix P with probability c. Then, the end-points of such a 
random walk has the distribution 7r(s, c). 

Algorithm 2.1 (MC End Point) Simulate m runs of the random walk {Xt}t>Q initiated 
at node s. Evaluate Hj as a fraction of m random walks which end at node j G 1, . . . ,ri. 

The next observation leads to another Monte Carlo algorithm for Personalized PageRank. 
Denote Z := [I — cP]~^. We have the following interpretation for the elements of matrix Z: 
Zsj = Es[Nj], where Nj is the number of visits to node j by a random walk before a restart, 
and Es[-] is the expectation assuming that the random walk started at node s. Namely, Zsj 
is the expected number of visits to node j by the random walk initiated at state s with the 
run time geometrically distributed with parameter c. Thus, the formula ([Ij suggests the 
following estimator for Personalized PageRank 

7r,{s,c) = {l-c)-Y,N,{s,r), (2) 

r=l 
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where Nj (s, r) is the number of visits to state j during the run r of the random walk initiated 
at node s. Thus, we can suggest the second Monte Carlo algorithm. 

Algorithm 2.2 (MC Complete Path) Simulate m runs of the random walk {Xt}t>Q 
initiated at node s. Evaluate ttj as the total number of visits to node j multiplied by (1 — c)/m. 

As outputs of the proposed algorithms we would like to obtain with high probability 
either a top-k list of nodes or a top-k basket of nodes. 

Definition 2.1 The top-fc list of nodes is a list of k nodes with largest Personalized PageR- 
ank values arranged in a descending order of their Personalized PageRank values. 

Definition 2.2 The top-k basket of nodes is a set of k nodes with largest Personalized 
PageRank values with no ordering required. 

It turns out that it is beneficial to relax our goal and to obtain a top-fc basket with a 
small number of erroneous elements. 

Definition 2.3 We call relaxation-/ top-k basket a realization when we allow at most I 
erroneous elements from top-k basket. 

In the present work we aim to estimate the numbers of random walk runs m sufficient 
for obtaining top-fc list or top-k basket or relaxation-/ top-fc basket with high probability. 
In particular, we demonstrate that ranking converges considerably faster than the values of 
Personalized PageRank and that a relaxation-/ with quite small / helps significantly. 

Let us begin the analysis of the algorithms with the help of an illustrating example on the 
Wikipedia entity graph. We shall carry out the development of the example throughout the 
paper. There is a number of reasons why we have chosen the Wikipedia entity graph. Firstly, 
the Wikipedia entity graph is a non-trivial example of a complex network. Secondly, it has 
been shown that the Green's measure which is closely related to Personalized PageRank is 
a good measure of similarity on the Wikipedia entity graph [55] • In addition, we note that 
Personalized PageRank is a good similarity measure also in social networks [521 and on the 
Web l^ni . Thirdly, we apply our person name disambiguation algorithm on the real Web for 
which we cannot compute the real values of the Personalized PageRank. The Personalized 
PageRank can be computed with high precision for the Wikipedia entity graph with the 
help of BVGraph/WebGraph framework [7]. 

Illustrating example: Since our work is concerned with application of Personalized PageR- 
ank to the name disambiguation problem, let us choose a common name. One of the most 
common English names is Jackson. We have selected three Jacksons who have entries in 
Wikipedia: Jim Jackson (ice hockey), Jim Jackson (sportscaster) and Michael Jackson. Two 
Jacksons have even a common given name and both worked in ice hockey, one as an ice hockey 
player and another as an ice hockey sportscaster. In Tables [T][3] we provide the exact lists of 
top-10 Wikipedia articles arranged according to Personalized PageRank vectors. In Table [T] 
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the seed node for the Personahzed PageRank is the article Jim Jackson (ice hockey), in 
Table[2]the seed node is the article Jim Jackson (sportscaster) , and in Table[2]the seed 
node is the article Michael Jackson. We observe that each top-10 list identifies quite well 
its seed node. This gives us hope that Personalized PageRank can be useful in the name 
disambiguation problem. (We shall discuss more the name disambiguation problem in Sec- 
tion [S]) Next we run the Monte Carlo End Point method starting from each seed node. We 
note that top-10 lists obtained by Monte Carlo methods also identify well the original seed 
nodes. It is interesting to note that to obtain a relaxed top-10 list with two or three erro- 
neous elements we need different number of runs for different seed nodes. To obtain a good 
relaxed top-10 list for Michael Jackson we need to perform about 50000 runs, whereas for 
a good relaxed top-10 list for Jim Jackson (ice hockey) we need to make just 500 runs. 
Intuitively, the more immediate neighbours a node has, the larger number of Monte Carlo 
steps is required. Starting from a node with many immediate neighbours the Monte Carlo 
method easily drifts away. In Figures [T][3] we present examples of typical runs of the Monte 
Carlo End Point method for the three different seed nodes. An example of the Monte Carlo 
Complete Path method for the seed node Michael Jackson is given in Figure |31 Indeed, as 
expected, it outperforms the Monte Carlo End Point method. In the following sections we 
shall quantify all the above qualitative observations. 



Tabic 1: Top-10 lists for Jim Jackson (ice hockey) 



No. 


Exact Top-10 List 


MC End Point (m=500) 


1 


Jim Jackson (ice hockey) 


Jim Jackson (ice hockey) 


2 


Ice hockey 


Winger (ice hockey) 


3 


National Hockey League 


1960 


4 


Buffalo Sabres 


National Hockey League 


5 


Winger (ice hockey) 


Ice hockey 


6 


Calgary Flames 


February 1 


7 


Oshawa 


Buffalo Sabres 


8 


February 1 


Oshawa 


9 


1960 


Calgary Flames 


10 


Ice hockey rink 


Columbus Blue Jackets 



3 Variance based performance comparison and CLT ap- 
proximations 

In the MC End Point algorithm the distribution of end points is multinomial [20] . Namely, 
if we denote by Lj the number of paths that end at node j after m runs, then we have 

P{Li = ll,L2 = /2, • • ■ ,iri = In} = j n , , , ^^1^2 ' ' ' ^« • (3) 

ti .to- ' ' ' hi' 
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100 200 300 400 500 



Figure 1: The number of correctly detected elements by MC End Point for the seed node 
Jim Jackson (ice hockey). 



1000 2000 3000 4000 5000 



Figure 2: The number of correctly detected elements by MC End Point for the seed node 
Jim Jackson (sportscaster) . 
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Figure 3: The number of correctly detected elements by MC End Point for the seed node 
Michael Jackson. 



Figure 4: The number of correctly detected elements by MC Complete Path for the seed 
node Michael Jackson. 
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Table 2: Top-10 lists for Jim Jackson (sportscaster) 



No. 


Exact Top-10 List 


MC End Point (m=5000) 


1 


Jim Jackson (sportscaster) 


Jim Jackson (sportscaster) 


2 


Philadelphia Flyers 


Steve Coates 


3 


United States 


New York 


4 


Philadelphia Phillies 


United states 


5 


Sportscaster 


Philadelphia Flyers 


6 


Eastern League (baseball) 


Gene Hart 


7 


New Jersey Devils 


Sportscaster 


8 


New York - Penn League 


New Jersey Devils 


9 


Play-by-play 


Mike Emrick 


10 


New York 


New York - Penn League 



Table 3: Top-10 lists for Michael Jackson 



No. 


Exact Top-10 List 


MC End Point (m=50000) 


1 


Michael Jackson 


Michel Jackson 


2 


United States 


United states 


3 


Billboard Hot 100 


Pop music 


4 


The Jackson 5 


Epic Records 


5 


Pop music 


Billboard Hot 100 


6 


Epic records 


Motown Records 


7 


Motown Records 


The Jackson 5 


8 


Soul music 


Singing 


9 


Billboard (magazine) 


Hip Hop music 


10 


Singing 


Gary, Indiana 



Thus, the standard deviation of the MC End Point estimator for the k element is given by 

cr(7rfe) = a{Lk/m) = -^s/iTkil - Hk)- (4) 

An expression for the standard deviation of the MC Complete Path is more complicated. 
From (O, it follows that 

First, we recall that 

E,{Nk} = Zsk = Ms)/{l-c). (6) 
Then, from |21) . it is known that the second moment of A^^ is given by 

E,{Nl}^[Z{2Zdg-I)U. 
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where Zdg is a diagonal matrix having as its diagonal the diagonal of matrix Z and [A\ik 
denotes the (i, fc)*^ element of matrix A. Thus, we can write 

Es{Nl} - llZ{2Zdg - I)lk = Y3^7r(s)(2Z<ig - /)lfc 

(^Y~^T^k{s)'Kk{k) - TTkis)^ . (7) 



1 - c VI - c 
Substituting ^ and ([7]) into ([S]), we obtain 

1 



a(^fc) = ^^nk{s){2^k{k) - (1 - c) - 7rfe(s)). (8) 



Since TTkik) w 1 — c, we can approximate (7(71^) with 

1 



a{TTk) ~ V'^fc(s)((l - c) - 7rfc(s)). 



Comparing the latter expression with (|4]), we can see that MC End Point requires approxi- 
mately 1/(1 — c) steps more than MC Complete Path. This was expected as MC End Point 
uses only information from end points of the random walks. We would like to emphasize 
that 1/(1 — c) can be a significant coefficient. For instance, if c = 0.85, then 1/(1 — c) 6.7. 
Let us provide central limit type theorems for our estimators. 

Theorem 3.1 For large m, a multivariate normal density approximation to the multinomial 
distribution Q is given by 

■ ^ X ("-l)/2 



1 \ I \^(lr- m-KC) 



n7ri7r2 • • • 7r„ / 2 . . , 

subject to X]r=i ~ ™- 

Proof. See US] and [20] ■ 

Now we consider MC Complete Path. First, we note that the vectors A''(s, r) = (7Vi(s, r), . . . , 7V„(s, r)) 
with r = 1, 2, . . . form a sequence of i.i.d. random vectors. Hence, we can apply the multi- 
variate central limit theorem. Denote 



N(s, m) — V N(s, r). (10) 



771 

r=l 
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Theorem 3.2 Letm go to infinity. Then, we have the following convergence in distribution 
to a multivariate normal distribution 



(n{s, 



where N{s) = I'^Z and S(.s) ~ E{N'^ {s,r)N{s,r)} — N'^{s)N{s) is a covariance matrix, 
which can be expressed as 

s(s) ^n{s)z + z^n (s) -n{s)- z^iaf^- (ii) 

where the matrix fl{s) = {ujjk{s)} is defined by 



^sj, if j — k, 
0, otherwise. 



Proof. The convergence fohows from the standard multivariate central limit theorem. We 
only need to establish the formula for the covariance matrix. 
The covariance matrix can be expressed as follows |27) : 

n 

S(s) = {D{j)Z + ZD{j) - D{j)) - Z^lslJZ, (12) 

J=l 

where D[j) is defined by 

1, if fc = / = j, 
0, otherwise. 



dkiij) = 



Let us consider ZsjD{j)Z in component form. 

n n n n 

Zsj ^ dl^{j)z^k = ^ ZsjSljZjk = ZslZlk = y^^UJlj{s)Zjk, 

j=l lp=l j=l j=l 

and it implies that ZsjD{j)Z = il{s)Z. Symmetrically, X]j=i ZsjZD{j) = Z'^n{s). 

Equality ZsjD{j) = f2(s) can be easily established. This completes the proof. 

We would like to note that in both cases we obtain the convergence to rank deficient 
(singular) multivariate normal distributions. 

Of course, one can use the joint confidence intervals for the CLT approximations to 
estimate the quality of top-/c list or basket. However, it appears that we can propose more 
efficient methods. Let us consider as an example mutual ranking of two elements k and / 
from a list. For illustration purpose, assume that the elements are independent and have 
the same variance. Suppose that we apply some version of CLT approximation. Then, we 
need to compare two normal random variables Yk and Yi with means tt^ and tt;, and with 
the same variance tr^ . Without loss of generality we assume that TTk > tti . Then, it can be 
shown that one needs twice as more experiments to guarantee that the random variable Yk 
and Yi inside their confidence intervals with the confidence level a than to guarantee that 
P{Yk > Yi} = a. Thus, it is more beneficial to look at the order of elements rather than 
their absolute values. We shall pursue this idea in more detail in the ensuing sections. 



□ 
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4 Convergence based on order 

For the two introduced Monte Carlo methods we would like to calculate or to estimate 
a probability that after a given number of steps we correctly obtain top-A: list or top-Zc 
basket. Namely, we need to calculate the probabilities P{Li > ■ ■ ■ > Lk > Lj,yj > k} and 
P{Li > Lj,\/i,j : i < k < j} respectively, where Lk, k G 1, . . . ,n, can be either the Monte 
Carlo estimates of the ranked elements or their CLT approximations. We refer to these 
probabilities as the ranking probabilities and we refer to complementary probabilities as 
misranking probabilities [5]. Even though, these probabilities are easy to define, it turns out 
that because of combinatorial explosion their exact calculation is infeasible for non-trivial 
cases. 

We first propose to estimate the ranking probabilities of top-k list and top-A; basket with 
the help of Bonferroni inequality p^. This approach works for reasonably large values of 
m. 

4.1 Estimation by Bonferroni inequality 

Drawing correctly the top-fc basket is defined by the event 

i<k<] 

Let us apply to this event the Bonferroni inequality 

{i<k<j J j<fe<i 

Equivalently, we can write the following upper bound for the misranking probability 



We obtain 



1 pI fl {U>L,}\< Y P{L^<L,}. (13) 

\i<k<j J i<k<j 

We note that it is very good that we obtain an upper bound in the above expression for the 
misranking probability, since the upper bound will provide a guarantee on the performance 
of our algorithms. Since in the MC End Point method the distribution of end points is 
multinomial (see ([3])), the probability P {Li < Lj} is given by 

P{U < L,} = (14) 
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li+lj<m, li<lj ^ J ^ * J' 

The above formula can only be used for small values of to. For large values of m, we can 
use the CLT approximation for the both MC methods. To distinguish between the original 
number of hits and its CLT approximation, we use Lj for the original number of hits at 
node j and Yj for its CLT approximation. First, we obtain a CLT based expression for the 
misranking probability for two nodes P {Yi < Yj}. Since the event {Yi < Yj} coincides with 
the event {Yi — Yj<0} and a difference of two normal random variables is again a normal 
random variable, we obtain 

P {Y, < Yj} = p {k, - < 0} = 1 - ^V^p^J), 

where $(•) is the cumulative distribution function for the standard normal random variable 
and 

E[Yi\~E[Y,] 
' ^<J^{Y,)~2cov{Y,,Y,) + a^Y,y 
For large m, the above expression can be bounded by 



Since the misranking probability for two nodes P {Yi <Yj} decreases when j increases, we 
can write 



^ p{ n {^^>^^}M 



i<k<j 



k 



E E p{Y^<Y,}+ E p{Y^<yr}]^ 

i=i \]=k+i 

for some j*. This gives the following upper bound 

I •t<fc<i 



< 



E E (i-'f(V^P.,)) + ^E^"^"- (15) 
i=i j=k+i ^ i=\ 

Since we have a finite number of terms in the right hand side of expression (|15p. we 
conclude that 



RR n° 7367 



14 



K. Avrachenkov, N. Litvak, D. Nemirovsky, E. Smirnova & M. Sokol 



Theorem 4.1 The misranking probability of the top-k basket tends to zero with geometric 
rate, that is, 

1 -p| fl {r, > y,} i < Ca™, 
[i<fc<i J 

for some C > and a G (0, 1). 

We note that pij has a simple expression in the case of the multinomial distribution 

TT,: — TTi 

Pij = 



^7rj(l - TTi) + 271.111 j + 7rj(l - TTj) 

For MC Complete Path = T.u{s) and cov(yi,y,) = Sij(s) where Ei,;(s) and i;y(s) 

can be calculated by pT|) . 

The Bonferroni inequality for the top-fc list gives 

P{Yi> ■■■>Yk>Yj,-ij>k}> 
1- 5] P{Y,<Y,+,}- P{Yk<Y,}. 

l<i<k-l fe+l<j<n 

Using misranking probability for two elements, one can obtain more informative bounds for 
the top-fc list as was done above for the case of top-fc basket. For the misranking probability 
of the top-fc list we also have a geometric rate of convergence. 

Illustrating example (cont.): In Figure [5] we plot Bonferroni bound for the misranking 
probability given by p3p with the CLT approximation for the pairwise misranking proba- 
bility. We note that the Bonferroni inequality provides quite conservative estimation for the 
necessary number of MC runs. Below we shall try to obtain a better estimation. 

4.2 Approximation based on order statistics 

We can obtain more insight on the convergence based on order with the help of order 
statistics |13| . 

Let us denote by Xt G {1, n} the node hit by the random walk t, and let X have the 
same distribution as Xt . Now let us consider the s-th order statistic of the random variables 
Xt, t = 1, ...,TO. We can calculate its cumulative distribution function as follows: 

ni — s / 

' m 



s-l 

■ m 



<k} = J2i.)p{x> kyp{x < k} 



1 - ^ . p{x < ky{i - p{x < fc})™-' (16) 
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Figure 5: Bonferroni bound for MC End Point for the seed node Jim Jackson (ice 
hockey) and top-9 basket. 

It is interesting to observe that P{X(^s) < k} depends on the Personahzed PageRank 
distribution only via P{X < fc} = tti + ... + tt^. showing insensitivity property with respect 
to the distribution's taiL 

Next we notice that a reasonable minimal value of m corresponds to the case when the 
elements of the top-fc basket obtain r or more hits with a high probability and the other 
elements outside the top-fc basket will have very small probability of r-times hit. Thus, the 
probability P{X(^rk) < k} should be reasonably high and the probability of hitting r times 
the elements outside the top-fc basket should be small. The probability to hit the element j 
at least r times is given by 

P{Y, > r} = 1 - g - ""^r-- (17) 

Hence, choosing m for the fast detection of the top-fc basket we need to satisfy two criteria: 
(i) P{X(^rk) < k} > 1 - ei, and (ii) P{Yj > r) < 62 for j > fc. The probability in (i) and 
PiYj > r) in (ii) are both increasing with m. However, we have observed (see the illustrating 
example next) that for a given m the probabilities given in (jl6p and (jl7p drop drastically 
with r. Thus, we hope to be able to find a proper balance between m and r for a reasonably 
small value of r. 

We can further improve the computational efficiency for order statistics distribution (|16p 
with the help of incomplete Beta function as suggested in |T] . Namely, in our case we have 

P{X(^,)<k}^Ip^x<k}{s,m-s + l), (18) 

where 

Bia, h) Jo 
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- Top-9 basket 

- 10th element 



1500 2000 



Figure 6: Evaluations based on order statistics for the seed node Jim Jackson (ice 
hockey): P{X(^rk) < k} (solid line) and P{Lj > r} (dash line), fc = 9, j = 10, r = 5. 



is the incomplete Beta function. 

Illustrating example (cont.): We first consider the seed node Jim Jackson (ice hochey) 
In Figure [6] we plot the probabilities given by ((T6)) and (flT)) for m < 2000, r = 5, and fc = 9. 
For instance, if we take m = 250, P{X(45) < 9} = 0.9999 and P{Yio > 5} = 4.29 x lO"'"^. 
Thus, with very high probability we collect 45 hits inside the top-9 basket and the proba- 
bility for the 10-th element to receive more than or equal to 5 hits is very small. Figure [T] 
confirms that taking m = 250 is largely enough to detect the top-9 basket. Suppose now 
that we want to detect the top-10 basket. Then, Figure [7] corresponding to m < 10000 , 
r = 18 and fc = 10 suggests that to obtain correctly top-10 basket with high probability 
we need to spend about four times more operations than for the case of the top-9 basket. 
Here we already see an illustration to the "80/20 rule" which we discuss more in the next 
section. Now let us consider the seed node Michael Jackson. In Figure [5] we plot the 
probabihties given by ^ and ^ for m < 100000, r = 57, fc = 10 and j = 11. We 
have P{X(57o) < 10} = 0.9717 and P{Yii > 57} = 0.2338. Even though there is a sig- 
nificant chance to get some erroneous elements in the top-10 list, as Figure [H] suggests we 
get "high quality" erroneous elements. Specifically, we have P{yioQ > 57} = 0.0077 and 
^"{^500 > 57} = 8.8 X 10-4^. 



5 Solution relaxation 

In this section we analytically evaluate the average number of correctly identified top-fc 
nodes. We use the relaxation by allowing this number to be smaller than fc. Our goal is 
to provide a mathematical evidence for the observed "80/20 behavior" of the algorithm: 80 
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Figure 7: Evaluations based on order statistics for the seed node Jim Jackson (ice 
hockey): P{X(rfc) < k} (sohd line) and P{Lj > r} (dash hue), k = 10, j = 11, r = 18. 
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Figure 8: Evahiations based on order statistics for the seed node Michael Jackson: 
P{X(rk) < k} (sohd hue) and P{Lj > r} (dash hne), fc = 10, j = 11, r = 57. 



RR n° 7367 



18 



K. Avrachenkov, N. Litvak, D. Nemirovsky, E. Smirnova & M. Sokol 
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Figure 9: Evaluations based on order statistics for the seed node Michael Jackson: 
P{X^rk) < k} (solid line) and P{Lj > r} (dash line), k = 10, j = 100, r = 57. 



percent of the top-fc nodes are identified correctly in a very short time. Accordingly, we 
evaluate the number of experiments m for obtaining high quality top-fc lists. 

Let Mq be a number of correctly identified elements in the top-fc basket. In addition, 
denote by Ki the number of nodes ranked not lower than i. Formally, 

K, = Y,HLj>L,}, ^ = l,...,fc. 

Clearly, placing node i in the top-fc basket is equivalent to the event [Ki < k], and thus we 
obtain 

Cfc \ A: 

^l{i^, <fc} U^P(if, <fc). (19) 
1=1 / 1=1 

To evaluate E{Mq) by (jl9p we need to compute the probabilities P{Ki < k) for i = 
1, . . . , fc. Direct evaluation of these probabilities is computationally intractable. A Markov 
chain approach based on the representations from |12) is more efficient, but this method, too, 
resulted in extremely demanding numerical schemes in realistic scenarios. Thus, to charac- 
terize the algorithm performance, we suggest to use two simplification steps: approximation 
and Poissonisation. 

Poissonisation is a common technique for analyzing occupancy measures |16) . Clearly, 
the End Point algorithm is nothing else but an occupancy scheme where each independent 
experiment (random walk) results in placing one ball (visit) to an urn (node of the graph). 
Under Poissonisation, we assume that the number of random walks is not a fixed value m 
but a Poisson random variable AI with mean m. In this scenario, the number Yj of visits 
to page j has a Poisson distribution with parameter totTj and is independent of Yi for i =^ j. 
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Because the number of hits in the Poissonised model is different from the number of original 
hits, we use the notation Yi instead of Lj. Poissonisation simplifies the analysis considerably 
due to the imposed independence of the Yj 's. 

Next to Poissonisation, we also apply approximation of Mq by a closely related measure 
Mi: 

k 

M, ■.= k-Y,{K'Jk), 

i=l 

where K'^ denotes the number of pages outside the top-fc that are ranked higher than node 
i = 1, . . . , k. The idea behind Mi is as follows: is the number of mistakes with respect 
to node i that lead to errors in the identified top-fc list. The sum in the definition of Afi is 
the average number of such mistakes with respect to each of the top-fc- nodes. 

Two properties of Mi make it more tractable than Mq. First, the average value of AIi is 
defined as 

1 

E{Mi) = k--J2EiKl), 

which involves only the average values of K[ and not their distributions. Second, K[ involves 
only the nodes outside of the top-fc for each i = 1, . . . , fc, and thus we can make use of the 
following convenient measure /i(j/): 

n 

:= E{K[\Y, ^y)= P^^i > 2/), ^ = 1, . . . , fc, 

j=k+i 

which implies 

oo 

E{K'^) = Y,P{Y,=y)^i{y), z = l,...,fc. 
Therefore, we obtain the following expression for E{Mi): 

oo k 

E{Mi)^k--Y,t^{y)T.p'^^^^y^- (20) 

y=0 i=l 



Illustrating example (cont.): Let us calculate E{Mi) for the top-10 basket corresponding 
to the seed node Jim Jackson (ice hockey). Using formula ((20)) . for to = 8 x 10'^; 10 x 
103; 15 X 10^ we obtain E{Mi) = 7.75; 9.36; 9.53. It took 2000 runs to move from E{Mi) = 
7.75 to E{Mi) = 9.36, but then it needed 5000 runs to advance from E{Mi) = 9.36 to 
E{Mi) = 9.53. We see that we obtain quickly 2-relaxation or 1-relaxation of the top-10 
basket but then we need to spend a significant amount of effort to get the complete basket. 
This is indeed in agreement with the Monte Carlo runs (see e.g.. Figure [T]). In the next 
theorem we explain this "80/20 behavior" and provide indication for the choice of to. 
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Theorem 5.1 In the Poisonized End Point Monte Carlo algorithm, if all top-k nodes receive 
at least y — ma > 1 visits and Tik+i = (1 ~ where e > l/y then 
(i) to satisfy E{Mi) > (1 — a)k it is sufficient to have 



J2 ("^^j)^ ^-m^r. 



- (y + 1) • • • (2/ + 



< ak, 



and 



(ii) statement (i) is always satisfied if 

m > 2a^^e^'^[— log(e7rfc+iafc)]. 



(21) 



Proof, (i) By definition of Mi, to ensure that E{Mi) < (1 — a)k it is sufficient that 
E{K^\Yi) < ak for each Yi > y and each i = 1, . . . , fc. Now, (i) foUows directly since for each 
^ y we have E(K'^\Yi) < fj,(y) and by definition of fi{y) under Poissonisation we have 



j=k+i 



1+E 



^ (2/ + 1) • • • (2/ + 



(22) 



To prove (ii) , using and the conditions of the theorem, we obtain: 



Ky)< E 



(toTTj)!" 



[l + (l-e) + (l-£)2 + ..-] 



1 " 

7 E 



j=k+i 
W 1 mynl-l 



1 " 

7 T. - 



1 g-TOTTj- 



j=k+l 

^ m 1 1 frnTTk+iV 



e TTfe+i V y 



1 



[(l-£)eT 



(23) 



Here {1} holds because 'Yl,j>k+i'^j — ^ ^^^'^ ~ 1)! exp{— mTTj} is maximal at 

j = k + 1. The latter follows from the conditions of the theorem: rmrk+i = (1 — e)y < y — 1 
when e > l/y. In {2} we use that y\ > y^ /e^. 

Now, we want the last expression in (|23p to be smaller than a k. Solving for m, we get: 



ma(log(l —£)+£)< log(£7rfc+ia fc). 

Note that the expression under the logarithm on the right-hand side is always smaller than 1 
since a < 1, e < 1 and k-Kk+i < 1- Using (log(l — e)+e) = — J2T=2 /k > —e^/2, we obtain 
(ii). 



□ 
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From (i) we can already see that the 80/20 behavior of E{Mi) (and, respectively, E{Mq)) 
can be explained mainly by the fact that ^(y) drops drastically with y because the Poisson 
probabilities decrease faster than exponentially. 

The bound in (ii) shows that m should be rougthly of the order l/TTfe. The term is 
not defining since e does not need to be small. For instance, by choosing e = 1/2 we can 
filter out the nodes with Personalized PageRank not higher than 7rfe/2. This often may be 
sufficient in applications. Obviously, the logarithmic term is of a smaller order of magnitude. 

We note that the bound in (ii) is very rough because in its derivation we replaced tt^, 
J > k, by their maximum value i^k+i- In realistic examples, //(?/) will be much smaller than 
the last expression in (|23p . which allows for m much smaller than in (j2ip . In fact, in our 
examples good top-A: lists are obtained if the algorithm is terminated at the point when for 
some y, each node in the current top-Zc list has received at least y visits while the rest of the 
nodes have received at most y — d visits, where d is a small number, say d = 2. Such choice 
of m satisfies (i) with reasonably small a. Without a formal justification, this stopping rule 
can be understood since, roughtly, we have rmrk+i = rna{l — e) sa rna — d, which results in 
a small value of /i(?;). 

6 Application to Name Disambiguation 

In this section we apply Personalized PageRank computed using Monte-Carlo method to 
Person Name Disambiguation problem. In the context of Web search when a user wants to 
retrieve information about a person by his/her name, search engines typically return Web 
pages which contain the name but can refer to different persons. Indeed, person names are 
highly ambiguous, according to US Census Bureau approximately 90,000 names are shared 
by 100 million people. Approximately 5 — 10% of search queries contain person name To 
assist a user in finding the target person many studies have been done, in particular, within 
WePS initiative (http://nlp.uned.es/weps/weps-3/). 

In our approach we disambiguate the referents with the help of the Web graph structure. 
We define a related page as the one that addresses the same topic or one of the topics 
mentioned in a person page - a page that contains person name. Kleinberg in |22| has given 
an illuminating example that ambiguous senses of the query can be separated on a query- 
focused subgraph. In our context, focused subgraph is analogous to a graph of Web search 
result pages with their graph-based neighbourhood represented by forward and backward 
links. Therefore, we can expect that for ambiguous person name query densely linked parts 
of the subgraph form clusters corresponding to different individuals. 

The major problem of applying HITS algorithm and other community discovery methods 
to WePS dataset consists in the lack of information about the Web graph structure. Per- 
sonalized PageRank can be used to detect related pages of the target page. Our theoretical 
and experimental results show that, quite opportunely, Monte-Carlo method is a fast way to 
approximate Personalized PageRank in a local manner, i.e., using only page forward-links. 
In such a case global backward-link neighbours are usually missing. Therefore, generally we 
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cannot expect neighbourhoods of two pages referring to one person to be interconnected. 
Nevertheless, we found useful to examine content of related pages. 

In the following we will briefly describe our approach, further details will be published 
soon. With this approach we participated in WePS-3 evaluation campaign. 

6.1 System Description 
Web Structure based Clustering. 

It the first stage, we cluster person pages appeared in search results based on the Web 
structure. Thereto, we determine related pages of each person page using Personalized 
PageRank. To avoid negative effect of purely navigational links we perform random walk of 
the Monte-Carlo computation on links to pages with different host name than the current 
host. We estimate the top-k list of related pages for each target page. In experiments 
we have used two values offc = {8, 16} and also two settings of Personalized PageRank 
computation; the damping factor c equal to {0.2, 0.3} respectively. 

In the following step two Web pages that contain the name are merged in one cluster if 
they share some related pages. Since the whole link structure of the Web was unknown to 
us, the resulted Web structure clustering is limited to local forward-link neighbourhood of 
pages. We therefore appeal to the content of the pages in the next stage. 

Content based Clustering. 

In the second stage, the rest of the pages that did not show any link preference are clustered 
based on the content. With this goal in mind, we apply a preprocessing step to all pages 
including person pages and pages related to them. Next, for each of these page we build a 
vector of words with corresponding frequency score (i/) in the page. After that, we use a 
re-weighting scheme as follows. The word w score at person page tf{t, w), is updated at 
each related page r in the following way: tf'{t, w) = tf{t, w) + tf{t, w) * tf{r, w), where r 
is a page in related pages set of person page t and person page t is a page obtained from 
search results. This step resembles voting process. Words that appear in related pages get 
promoted and thus, random word scores found in the person page are lowered. At the end, 
vector is normalized and top 30 frequent terms are taken as a person page profile. 

Finally, we apply HAC algorithm on the basis of Web structure clustering to the rest of 
the pages. Specifically, we use average-linkage HAC with the cosine measure of similarity. 
The clustering threshold for HAC algorithm was determined manually. 

6.2 Results 

During the evaluation period of WePS-3 campaign we have experimented with the number 
of related pages and the type of content extracted from pages. We have chosen to combine 
a small number of related pages with the full content of the page and, oppositely, a large 
number of related pages with small extracted content. We carried out the following runs. 
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PPR8 (PPR16): top 8 (16) related pages were computed using Personalized PageRank, 
the full (META tag) content of the Web page was used in the HAC step. 

Our methods achieved second place performance at WePS-3 campaign. We received 
values of metrics for PPR8 (PPR16) runs as follows: 0.7(0.71); 0.45(0.43); 0.5(0.49) for 
BCubed precision (BP), recall (BR) and harmonic mean of BP and BR (F-0.5) respectively. 
The run PPR16 has shown slightly worse performance compared to the PPR8 run. The 
results have demonstrated that our methods are promising and suggest future research on 
using Web structure for the name disambiguation problem. 
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